Data Engineering

  • Importing libraries
In [ ]:
import pandas as pd
import numpy as np
  • Reading Data from heart.csv file.
In [ ]:
dt = pd.read_csv("heart.csv")
  • Analysing Data
In [ ]:
dt
Out[ ]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
298 57 0 0 140 241 0 1 123 1 0.2 1 0 3 0
299 45 1 3 110 264 0 1 132 0 1.2 1 0 3 0
300 68 1 0 144 193 1 1 141 0 3.4 1 2 3 0
301 57 1 0 130 131 0 1 115 1 1.2 1 1 3 0
302 57 0 1 130 236 0 0 174 0 0.0 1 1 2 0

303 rows × 14 columns

In [ ]:
dt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
In [ ]:
dt.columns
Out[ ]:
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')
In [ ]:
dt.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved',
       'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']
In [ ]:
dt['sex'].value_counts()
Out[ ]:
1    207
0     96
Name: sex, dtype: int64
In [ ]:
dt['chest_pain_type'].value_counts()
Out[ ]:
0    143
2     87
1     50
3     23
Name: chest_pain_type, dtype: int64
In [ ]:
dt['fasting_blood_sugar'].value_counts()
Out[ ]:
0    258
1     45
Name: fasting_blood_sugar, dtype: int64
In [ ]:
dt['rest_ecg'].value_counts()
Out[ ]:
1    152
0    147
2      4
Name: rest_ecg, dtype: int64
In [ ]:
dt['exercise_induced_angina'].value_counts()
Out[ ]:
0    204
1     99
Name: exercise_induced_angina, dtype: int64
In [ ]:
dt['st_slope'].value_counts()
Out[ ]:
2    142
1    140
0     21
Name: st_slope, dtype: int64
In [ ]:
dt['thalassemia'].value_counts()
Out[ ]:
2    166
3    117
1     18
0      2
Name: thalassemia, dtype: int64
  • Coverting Data to human understanding.
In [ ]:
dt['sex'][dt['sex'] == 0] = 'female'
dt['sex'][dt['sex'] == 1] = 'male'

dt['chest_pain_type'][dt['chest_pain_type'] == 0] = 'typical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 1] = 'atypical angina'
dt['chest_pain_type'][dt['chest_pain_type'] == 2] = 'non-anginal pain'
dt['chest_pain_type'][dt['chest_pain_type'] == 3] = 'asymptomatic'

dt['fasting_blood_sugar'][dt['fasting_blood_sugar'] == 0] = 'lower than 120mg/ml'
dt['fasting_blood_sugar'][dt['fasting_blood_sugar'] == 1] = 'greater than 120mg/ml'

dt['rest_ecg'][dt['rest_ecg'] == 0] = 'normal'
dt['rest_ecg'][dt['rest_ecg'] == 1] = 'ST-T wave abnormality'
dt['rest_ecg'][dt['rest_ecg'] == 2] = 'left ventricular hypertrophy'

dt['exercise_induced_angina'][dt['exercise_induced_angina'] == 0] = 'no'
dt['exercise_induced_angina'][dt['exercise_induced_angina'] == 1] = 'yes'

dt['st_slope'][dt['st_slope'] == 0] = 'upsloping'
dt['st_slope'][dt['st_slope'] == 1] = 'flat'
dt['st_slope'][dt['st_slope'] == 2] = 'downsloping'

dt['thalassemia'][dt['thalassemia'] == 1] = 'normal'
dt['thalassemia'][dt['thalassemia'] == 2] = 'fixed defect'
dt['thalassemia'][dt['thalassemia'] == 3] = 'reversable defect'
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:23: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:24: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:25: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In [ ]:
dt
Out[ ]:
age sex chest_pain_type resting_blood_pressure cholesterol fasting_blood_sugar rest_ecg max_heart_rate_achieved exercise_induced_angina st_depression st_slope num_major_vessels thalassemia target
0 63 male asymptomatic 145 233 greater than 120mg/ml normal 150 no 2.3 upsloping 0 normal 1
1 37 male non-anginal pain 130 250 lower than 120mg/ml ST-T wave abnormality 187 no 3.5 upsloping 0 fixed defect 1
2 41 female atypical angina 130 204 lower than 120mg/ml normal 172 no 1.4 downsloping 0 fixed defect 1
3 56 male atypical angina 120 236 lower than 120mg/ml ST-T wave abnormality 178 no 0.8 downsloping 0 fixed defect 1
4 57 female typical angina 120 354 lower than 120mg/ml ST-T wave abnormality 163 yes 0.6 downsloping 0 fixed defect 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
298 57 female typical angina 140 241 lower than 120mg/ml ST-T wave abnormality 123 yes 0.2 flat 0 reversable defect 0
299 45 male asymptomatic 110 264 lower than 120mg/ml ST-T wave abnormality 132 no 1.2 flat 0 reversable defect 0
300 68 male typical angina 144 193 greater than 120mg/ml ST-T wave abnormality 141 no 3.4 flat 2 reversable defect 0
301 57 male typical angina 130 131 lower than 120mg/ml ST-T wave abnormality 115 yes 1.2 flat 1 reversable defect 0
302 57 female atypical angina 130 236 lower than 120mg/ml normal 174 no 0.0 flat 1 fixed defect 0

303 rows × 14 columns

In [ ]:
dt['thalassemia'].value_counts()
Out[ ]:
fixed defect         166
reversable defect    117
normal                18
0                      2
Name: thalassemia, dtype: int64
In [ ]:
dt.dtypes
Out[ ]:
age                          int64
sex                         object
chest_pain_type             object
resting_blood_pressure       int64
cholesterol                  int64
fasting_blood_sugar         object
rest_ecg                    object
max_heart_rate_achieved      int64
exercise_induced_angina     object
st_depression              float64
st_slope                    object
num_major_vessels            int64
thalassemia                 object
target                       int64
dtype: object
In [ ]:
data = dt[dt['thalassemia'] != 0]
In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 301 entries, 0 to 302
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      301 non-null    int64  
 1   sex                      301 non-null    object 
 2   chest_pain_type          301 non-null    object 
 3   resting_blood_pressure   301 non-null    int64  
 4   cholesterol              301 non-null    int64  
 5   fasting_blood_sugar      301 non-null    object 
 6   rest_ecg                 301 non-null    object 
 7   max_heart_rate_achieved  301 non-null    int64  
 8   exercise_induced_angina  301 non-null    object 
 9   st_depression            301 non-null    float64
 10  st_slope                 301 non-null    object 
 11  num_major_vessels        301 non-null    int64  
 12  thalassemia              301 non-null    object 
 13  target                   301 non-null    int64  
dtypes: float64(1), int64(6), object(7)
memory usage: 35.3+ KB
  • Saving the clean data into CleanHeart.csv file.
In [ ]:
data.to_csv('CleanHeart.csv', index=False)
In [ ]:
df = pd.read_csv('CleanHeart.csv')
In [ ]:
df
Out[ ]:
age sex chest_pain_type resting_blood_pressure cholesterol fasting_blood_sugar rest_ecg max_heart_rate_achieved exercise_induced_angina st_depression st_slope num_major_vessels thalassemia target
0 63 male asymptomatic 145 233 greater than 120mg/ml normal 150 no 2.3 upsloping 0 normal 1
1 37 male non-anginal pain 130 250 lower than 120mg/ml ST-T wave abnormality 187 no 3.5 upsloping 0 fixed defect 1
2 41 female atypical angina 130 204 lower than 120mg/ml normal 172 no 1.4 downsloping 0 fixed defect 1
3 56 male atypical angina 120 236 lower than 120mg/ml ST-T wave abnormality 178 no 0.8 downsloping 0 fixed defect 1
4 57 female typical angina 120 354 lower than 120mg/ml ST-T wave abnormality 163 yes 0.6 downsloping 0 fixed defect 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
296 57 female typical angina 140 241 lower than 120mg/ml ST-T wave abnormality 123 yes 0.2 flat 0 reversable defect 0
297 45 male asymptomatic 110 264 lower than 120mg/ml ST-T wave abnormality 132 no 1.2 flat 0 reversable defect 0
298 68 male typical angina 144 193 greater than 120mg/ml ST-T wave abnormality 141 no 3.4 flat 2 reversable defect 0
299 57 male typical angina 130 131 lower than 120mg/ml ST-T wave abnormality 115 yes 1.2 flat 1 reversable defect 0
300 57 female atypical angina 130 236 lower than 120mg/ml normal 174 no 0.0 flat 1 fixed defect 0

301 rows × 14 columns

Creating Pandas Profiling for EDA.

In [ ]:
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
In [ ]:
from pandas_profiling import ProfileReport
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
In [ ]:
profile = ProfileReport(df , title ='Pandas Profiling',explorative=True)
In [ ]:
profile.to_widgets()
/usr/local/lib/python3.6/dist-packages/pandas_profiling/profile_report.py:397: UserWarning: Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60).As an alternative, you can use the HTML report. See the documentation for more information.
  "Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60)."


  • Getting a HTML form of EDA report
In [ ]:
profile.to_file("heart_report.html")


Model Creation

  • Importing Libraries for Model Creation.
In [ ]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
In [ ]:
df.columns
Out[ ]:
Index(['age', 'sex', 'chest_pain_type', 'resting_blood_pressure',
       'cholesterol', 'fasting_blood_sugar', 'rest_ecg',
       'max_heart_rate_achieved', 'exercise_induced_angina', 'st_depression',
       'st_slope', 'num_major_vessels', 'thalassemia', 'target'],
      dtype='object')
  • Splitting data into feature and target data
In [ ]:
feature_data = df.drop(columns=['target'])     
target_data = df.target
In [ ]:
feature_data = data.drop(columns=['target'])     
target_data = data.target
  • Splitting Categorical and Numerical Data.
In [ ]:
cat_data = feature_data.select_dtypes(include=['object'])
print(cat_data.columns)
num_data = feature_data.select_dtypes(include=['int','float'])
print(num_data.columns)
Index(['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg',
       'exercise_induced_angina', 'st_slope', 'thalassemia'],
      dtype='object')
Index(['age', 'resting_blood_pressure', 'cholesterol',
       'max_heart_rate_achieved', 'st_depression', 'num_major_vessels'],
      dtype='object')
  • Applying OrdinalEncoder for Categorical data and StandardScaler for Numerical data.
In [ ]:
oe = OrdinalEncoder()
oe.fit(cat_data)
ss = StandardScaler()
ss.fit(num_data)
cat = pd.DataFrame(data = oe.transform(cat_data) , columns=cat_data.columns)
num = pd.DataFrame(data = ss.transform(num_data), columns= num_data.columns)
  • Creating Pipelines for each kind of data.
In [ ]:
cat_pipeline = make_pipeline(OrdinalEncoder())
num_pipeline = make_pipeline(StandardScaler())
  • Creating a Preprocessor pipeline by combining other pipelines.
In [ ]:
preprocessor = make_column_transformer(
              (cat_pipeline,cat_data.columns),
              (num_pipeline,num_data.columns)
)
  • Splitting data into test data and train data.
In [ ]:
from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(feature_data, target_data)
  • Importing the LogisticRegression library
In [ ]:
from sklearn.linear_model import LogisticRegression
  • Creating Model
In [ ]:
pipeline = make_pipeline(preprocessor, LogisticRegression())
  • Fitting Model with train data and getting the scores for both test and train data.
In [ ]:
pipeline.fit( trainX , trainY)
print("Training Score : ",pipeline.score(trainX, trainY))
print('Testing Score : ',pipeline.score(testX, testY))
Training Score :  0.8488888888888889
Testing Score :  0.8026315789473685
In [ ]:
pipeline.fit( trainX , trainY)
print("Training Score : ",pipeline.score(trainX, trainY))
print('Testing Score : ',pipeline.score(testX, testY))
Training Score :  0.8355555555555556
Testing Score :  0.88
  • Importing the RandomForestClassifier library
In [ ]:
from sklearn.ensemble import RandomForestClassifier
  • Creating Model
In [ ]:
rf_pipeline = make_pipeline( preprocessor , RandomForestClassifier( n_estimators= 100 ))
  • Fitting Model with train data and getting the scores for both test and train data.
In [ ]:
rf_pipeline.fit( trainX , trainY)
print("Training Score : ",rf_pipeline.score(trainX, trainY))
print('Testing Score : ',rf_pipeline.score(testX, testY))
Training Score :  1.0
Testing Score :  0.868421052631579
In [ ]:
rf_pipeline.fit( trainX , trainY)
print("Training Score : ",rf_pipeline.score(trainX, trainY))
print('Testing Score : ',rf_pipeline.score(testX, testY))
Training Score :  1.0
Testing Score :  0.7866666666666666
  • Importing the GridSearchCV library
In [ ]:
from sklearn.model_selection import GridSearchCV
  • Creating Pipeline
In [ ]:
gs_pipeline = make_pipeline(preprocessor, RandomForestClassifier(n_estimators=100))
  • Parameter tuning
In [ ]:
params = {'randomforestclassifier__n_estimators':[100,200,250],'randomforestclassifier__criterion':['gini','entropy'], 'randomforestclassifier__max_depth':[5,10,15]}
  • Creating Model with parameter tuning
In [ ]:
gs = GridSearchCV(gs_pipeline, param_grid=params, cv=5, n_jobs=4)
  • Fitting Model with train data and getting the scores for both test and train data.
In [ ]:
gs.fit( trainX , trainY)
print("Training Score : ",gs.score(trainX, trainY))
print('Testing Score : ',gs.score(testX, testY))
print('******************************')
print('Best params :',gs.best_params_)
print('Best Score :', gs.best_score_ )
Training Score :  0.9555555555555556
Testing Score :  0.8552631578947368
******************************
Best params : {'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 200}
Best Score : 0.8222222222222223
In [ ]:
gs.fit( trainX , trainY)
print("Training Score : ",gs.score(trainX, trainY))
print('Testing Score : ',gs.score(testX, testY))
print('******************************')
print('Best params :',gs.best_params_)
print('Best Score :', gs.best_score_ )
Training Score :  0.96
Testing Score :  0.84
******************************
Best params : {'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 200}
Best Score : 0.8399999999999999

Saving Model

Importing joblib library to create and save the model file to use for feture predictions.

In [ ]:
from sklearn.externals import joblib 
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)
In [ ]:
joblib.dump(pipeline, 'lm_88.joblib') 
Out[ ]:
['lm_88.joblib']
In [ ]:
joblib.dump(rf_pipeline, 'rf_n100_86.pkl') 
Out[ ]:
['rf_n100_86.pkl']
In [ ]:
joblib.dump(gs, 'rf_n200_md5_gini_85.pkl') 
Out[ ]:
['rf_n200_md5_gini_85.pkl']
  • TestX i.e testing data on which model is not trained we will use this data for feture presentations.
In [ ]:
testX
Out[ ]:
age sex chest_pain_type resting_blood_pressure cholesterol fasting_blood_sugar rest_ecg max_heart_rate_achieved exercise_induced_angina st_depression st_slope num_major_vessels thalassemia target
36 54 female non-anginal pain 135 304 greater than 120mg/ml ST-T wave abnormality 170 no 0.0 downsloping 0 fixed defect 1
173 60 male typical angina 130 206 lower than 120mg/ml normal 132 yes 2.4 flat 2 reversable defect 0
97 43 male non-anginal pain 130 315 lower than 120mg/ml ST-T wave abnormality 162 no 1.9 downsloping 1 fixed defect 1
220 55 male typical angina 140 217 lower than 120mg/ml ST-T wave abnormality 111 yes 5.6 upsloping 0 reversable defect 0
234 51 male typical angina 140 299 lower than 120mg/ml ST-T wave abnormality 173 yes 1.6 downsloping 0 reversable defect 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28 65 female non-anginal pain 140 417 greater than 120mg/ml normal 157 no 0.8 downsloping 1 fixed defect 1
138 64 male typical angina 128 263 lower than 120mg/ml ST-T wave abnormality 105 yes 0.2 flat 1 reversable defect 1
262 63 female typical angina 108 269 lower than 120mg/ml ST-T wave abnormality 169 yes 1.8 flat 2 fixed defect 0
232 64 male typical angina 120 246 lower than 120mg/ml normal 96 yes 2.2 upsloping 1 fixed defect 0
100 59 male asymptomatic 178 270 lower than 120mg/ml normal 145 no 4.2 upsloping 0 reversable defect 1

75 rows × 14 columns

In [ ]:
testX['target'] = testY
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
  • Saving the testX data as HeartTestData1.csv file
In [ ]:
testX.to_csv('HeartTestData1.csv', index=False)

Prediction

  • Predicting Output for a single row
In [ ]:
pipeline.predict(pd.DataFrame([[49, 'male', 'non-anginal pain', 109, 102, 'lower than 120mg/ml', 'ST-T wave abnormality', 138, 'no', 2.2, 'flat', 2, 'fixed defect']] , columns=['age', 'sex', 'chest_pain_type', 'resting_blood_pressure','cholesterol', 'fasting_blood_sugar', 'rest_ecg','max_heart_rate_achieved', 'exercise_induced_angina', 'st_depression','st_slope', 'num_major_vessels', 'thalassemia']))
Out[ ]:
array([1])
  • Loading the Model file.
  • Predicting the output for all the rows in the testX.
In [ ]:
from joblib import load
model = load('lm_88.pkl')
pred = model.predict(testX.drop(columns='target'))
In [ ]:
dc.target.value_counts()
Out[ ]:
0    38
1    37
Name: target, dtype: int64
    • From above output we get to know that we have 38 people without Heart diseases and 37 people with Heart Disease in test Data.
In [ ]:
import pandas as pd
dc = pd.DataFrame()
dc['pred'] = pred
dc['target'] = testX['target']
dc[ dc.pred == dc.target ]
Out[ ]:
pred target
0 1 1
1 0 0
2 1 1
3 0 0
4 0 0
... ... ...
66 1 1
67 1 1
69 0 0
70 1 1
73 0 0

66 rows × 2 columns

  • From above output we get to know that we got 66 right predictions outoff 75.
In [ ]:
dc[ dc.pred != dc.target ]
Out[ ]:
pred target
10 1 0
11 1 0
37 0 1
44 1 0
53 1 0
68 1 0
71 0 1
72 1 0
74 0 1
  • From above output we get to know that we got 9 wrong predictions outoff 75.

Accuracy

In [7]:
print('Model Accuracy :( No. of right predictions / Total no. of rows in data ) * 100  ')
print('Model Accuracy : ', (66 / 75) * 100 , '%')
Model Accuracy :( No. of right predictions / Total no. of rows in data ) * 100  
Model Accuracy :  88.0 %